Mapping of Sequence Reads to the Reference Genomes ◾ 63
2.3 READ SEQUENCE ALIGNMENT AND ALIGNERS
For read mapping, we usually have millions of reads produced by a high-throughput
instrument and we wish to determine the origin of each of the reads in the sequence of
a reference genome. For most of sequencing applications, the read alignment or mapping
is the slowest and the most computationally expensive step. This is because mapping pro-
grams will attempt to determine the most likely points of origin for each read with respect
to a reference genome. Mapping the reads produced from eukaryotic RNA-Seq requires
extra efforts by aligners. In eukaryote, the coding regions (exons) of the genes are sepa-
rated by non-coding regions (introns). Since only transcriptome or gene transcripts are
targeted in the RNA sequencing, the aligners used for mapping RNA-Seq reads must be
aware of the non-contiguous nature of the exons and the challenge of the detection of the
splicing regions.
Indeed, before performing alignment, we need to download the sequence of the refer-
ence genome of the species studied and then index the reference genome so that the loca-
tions, where reads maps, can be found easily upon the process of searching and alignment.
For most of the aligners, indexing of the reference genome is the first step before per-
forming read mapping. Above, we discussed the most commonly used indexing methods
for storing and organizing the reference genome sequence so it can be easily searched to
determine the locations (coordinates) of aligned reads. There is another challenge faced
by aligners; the reads produced by sequencers may not be exactly aligned to a location in
the reference genome sequence because of base call errors or may not be naturally due to
mutations (substitutions, deletions, or insertions) in the DNA sequences of the individual
FIGURE 2.12 (a) BWT, (b) rank table, and (c) lookup table.